0%

(ICCV 2017) Stackgan:Text to photo-realistic image synthesis with stacked generative adversarial networks

Keyword [StackGAN]

Zhang H, Xu T, Li H, et al. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks[C]//IEEE Int. Conf. Comput. Vision (ICCV). 2017: 5907-5915.



1. Overview


1.1. Motivation

  • existing methods fail to contain details and vivid object parts
  • instability of training GAN
  • the limited number of training text-image pairs often results in sparsity in the text conditioning manifold and such sparsity makes it difficult to train GAN

In this paper, it proposed StackGAN

  • decompose the hard problem into more manageable sub-problems
    • stage I. sketch the primitive shape and colors, low-resolution
    • stage II. details


  • Conditioning Augmentation Technique. smoothness in the latent conditioning manifold

1.2. Contribution

  • StackGAN
  • Conditioning Augmentation (CA)

1.3.1. Generative Model

  • VAE
  • Pixel RNN
  • GAN
  • energy-based GAN

1.3.2. Conditional Image Generation

  • variable such as attributes or class label
  • image-to-image. photo editing, domain transfer, SR

1.3.3. Series of GAN



2. StackGAN




2.1. Conditioning Augmentation

  • latent space for text embedding usually high, limited amount of data causes discontinuity in the latent data manifold
  • CA yields more training pairs, smoothness over conditioning manifold and avoid overfitting


2.2. Stage-I GAN



  • set λ = 1
  • I_0. real image

2.3. Stage-II GAN



  • s_0. LR generated by stage-I
  • two stages share the same text encoder and different CA

2.4. Details

  • first train stage-I GAN, fix stage-II GAN
  • then train stage-II GAN, fix stage-I GAN
  • 0.0002 Adam decay 0.5, mini-batch 64

  • nearest-neighbour upsample

  • dimension of z 100



3. Experiments


3.1. Dataset

  • MSCOCO
  • CUB

3.2. Metric

  • Inception Score


  1. x. generated sample
  2. y. label predicted by Inception Model (fine-tune on Experiment dataset)
  • Human Evaluation

3.3. Comparison




  • GAN-INT-CLS. only reflect the general shape and color of the birds
  • GAWWN. fail to generate plausible images


  • stage-II GAN can correct the defects of stage-I
  • even when stage-I fails to draw a plausible shape, shape-II can generate reasonable object

3.4. Ablation Study



  • CA helps stabilize training and improve diversity of generated samples, because of its ability to encourage robustness to small perturbation along the latent manifold

3.5. Interpolation